Introduction#

This project aims to analyze the global impact of the COVID-19 pandemic on health outcomes and socioeconomic status. By examining datasets related to COVID-19 case numbers, deaths, vaccination rates, and socioeconomic indicators such as GDP, we will explore how the pandemic has affected different population groups worldwide. The goal is to identify patterns and provide insights that could inform public health policies and economic recovery efforts.

First perspective:#

Countries with higher GDP and higher vaccination rates have managed the COVID-19 pandemic more effectively, resulting in lower mortality rates and better health outcomes despite high case numbers.

Arguments:#

  1. Higher GDP Allows for Better Healthcare Infrastructure and Access to Medical Supplies::

    • Visualization 1: Cases per million by country

    • Visualization 2: Death per million by country

    • Visualization 3: Comparison of Total Cases and Deaths per Million by Country

    • Visualization 4: total cases and deaths per million by Country

    • Visualization 6: Comparison of GDP and Deaths per Million by Country

    • Visualization 7: Total Cases per Million by Income Category

    • Visualization 8: Tests per Thousand vs GDP per Capita (2021)

  2. Higher vaccination rates mitigate severe cases and reduce mortality:

    • Visualization 5: Excess mortality per million inhabitants vs. Total vaccinations per hundred inhabitants, per country for 2021

Second perspective:#

Lower-income countries faced greater difficulties in managing the COVID-19 pandemic due to limited healthcare resources and slower vaccine distribution, leading to higher mortality rates.

Arguments:#

  1. Limited healthcare infrastructure and economic instability:

    • Visualization 2: Death per million by country

    • Visualization 3: Comparison of Total Cases and Deaths per Million by Country

    • Visualization 4: total cases and deaths per million by Country

    • Visualization 6: Comparison of GDP and Deaths per Million by Country

    • Visualization 7: Total Cases per Million by Income Category

    • Visualization 8: Tests per Thousand vs GDP per Capita (2021)

    • Visualization 9: Tests per Thousand vs Total Cases (2021)

  2. Slower vaccine distribution:

    • Visualization 5: Excess mortality per million inhabitants vs. Total vaccinations per hundred inhabitants, per country for 2021

Dataset and preprocessing#

The datasets that we use are the OWID Covid-19 dataset and the GDP per capita, PPP in US\( dataset. The Covid-19 dataset contains statistics on COVID-19 for every country, through the years 2020-2024. It has variables such as “Total_death” and “Total_cases”. The second dataset contains the GDP per capita in PPP in US\) per country per year. This means it contains the economic output in US dollars per inhabitant. PPP stands for purchasing power parity and it means the differences between countries have been normalized for differences in purchasing power, to make the comparisons more fair. The idea of the dataset is to give a reliable overview of the economic power of the countries per year.

Preprocessing:

We preprocessed these databases by filtering it for the year 2021. Then for specific variable numbers we take the last value per country for 2021. For a few graphs we also filtered out some values in the location variable, we took out continents, the whole world and income categories. Because these values would make outliers in the graphs.

hoi, update?

import plotly.express as px
import pandas as pd
import seaborn as sns

file_path = 'GDP-data.csv'
GDPdata = pd.read_csv(file_path, skiprows=4)
file_path = 'owid-covid-data.csv'
CovidData = pd.read_csv(file_path)

GDPdata = GDPdata.rename(columns={'Country Code': 'iso_code'})


CovidData['date'] = pd.to_datetime(CovidData['date'])
CovidData = CovidData[CovidData['date'] == '2020-12-31']

# Join de dataframes op de Date kolom
df = pd.merge(GDPdata, CovidData, on='iso_code', how='inner')

fig1 = px.choropleth(
    df, 
    locations="iso_code",
    color="total_cases_per_million",
    hover_name="Country Name",
    color_continuous_scale=px.colors.sequential.Plasma,
    title="Cases per million by Country"
)

fig1.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    ),
    height=600
)

fig1.show()

fig2 = px.choropleth(
    df, 
    locations="iso_code",
    color="total_deaths_per_million",
    hover_name="Country Name",
    color_continuous_scale=px.colors.sequential.Plasma,
    title="Deaths per million by Country"
)

fig2.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    ),
    height=600
)

fig2.show()
import plotly.express as px
import pandas as pd
import plotly.graph_objects as go
import numpy as np
import seaborn as sns


GDPdata = GDPdata.rename(columns={'Country Code': 'iso_code'})


CovidData['date'] = pd.to_datetime(CovidData['date'])
CovidData = CovidData[CovidData['date'] == '2020-12-31']

df = pd.merge(GDPdata, CovidData, on='iso_code', how='inner')
df = df.dropna(subset=['total_cases_per_million', 'total_deaths_per_million'])

df = df[df['iso_code'] !=  'PER']

fig = px.scatter(
    df,
    x="total_cases_per_million",
    y="total_deaths_per_million",
    hover_name="Country Name",
    trendline="ols",
    title="Comparison of Total Cases and Deaths per Million by Country",
    labels={
        "total_cases_per_million": "Total Cases per Million",
        "total_deaths_per_million": "Total Deaths per Million"
    }
)

correlation = df['total_cases_per_million'].corr(df['total_deaths_per_million'])
print(correlation)

fig.update_traces(textposition='top center')
fig.update_layout(
    height=600
)

fig.show()

fig = px.scatter_geo(
    df, 
    locations="iso_code",
    size="total_cases_per_million",
    color="total_deaths_per_million",
    hover_name="Country Name",
    size_max=50,
    color_continuous_scale=px.colors.sequential.Reds,
    title="total cases and deaths per million by Country"
)

fig.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    ),
    height=600
)

fig.show()
0.777385518505325
import pandas as pd
import plotly.express as px

df = pd.read_csv('owid-covid-data.csv')

df['date'] = pd.to_datetime(df['date'])


df_2021 = df[df['date'].dt.year == 2021].copy()  
def fill_last_available(df, col):
    return df[col].groupby(df['location']).ffill()

df_2021['excess_mortality_cumulative_per_million'] = fill_last_available(df_2021, 'excess_mortality_cumulative_per_million')
df_2021['total_vaccinations_per_hundred'] = fill_last_available(df_2021, 'total_vaccinations_per_hundred')

df_last_2021 = df_2021.groupby('location').last().reset_index()

fig = px.scatter(df_last_2021, x='total_vaccinations_per_hundred', y='excess_mortality_cumulative_per_million', 
                 trendline='ols', trendline_color_override='darkblue',
                 title='Excess mortality per million inhabitants vs. Total vaccinations per hundred inhabitants, per country for 2021',
                 labels={'total_vaccinations_per_hundred': 'Total vaccinations per hundred inhabitants',
                         'excess_mortality_cumulative_per_million': 'Excess mortality cumulative per million inhabitants'},
                 hover_name='location', opacity=0.7,
                 color_discrete_sequence=['cornflowerblue'])

fig.update_layout(xaxis=dict(range=[20, 340]))


fig.update_traces(
    line=dict(width=2, color='darkblue')
)

fig.update_layout(width=1000,
                  height=600)

fig.show()
CovidData = pd.read_csv('owid-covid-data.csv')
GDPdata = pd.read_csv('GDP-data.csv', skiprows=4)

Covid_2021 = CovidData[CovidData['date'].str.startswith('2021')]
Covid_deaths_2021 = Covid_2021.groupby('location').last()['total_deaths_per_million'].reset_index()

GDPdata = GDPdata.rename(columns={'Country Name': 'location'})

df = pd.merge(GDPdata, Covid_deaths_2021, on='location', how='inner')


df = df[df['location'] != 'World']
df = df[df['location'] != 'Upper middle income']
df = df[df['location'] != 'Lower middle income']
df = df[df['location'] !=  'High income']
df = df[df['location'] !=  'Low income']
df = df[df['location'] !=  'European Union']
df = df[df['location'] !=  'North America']
df = df[df['location'] !=  'South America']
df = df[df['location'] !=  'Asia']
df = df[df['location'] !=  'Oceania']
df = df[df['location'] !=  'Africa']


correlation = df['total_deaths_per_million'].corr(df['2021'])
print(correlation)

fig = px.scatter(
    df,
    x="2021",
    y="total_deaths_per_million",
    hover_name="location",
    trendline="ols",
    title="Comparison of GDP and Deaths per Million by Country",
    labels={
        "2021": "GDP",
        "total_deaths": "Total Deaths per million"
    }
)
fig
0.21265691224205055
CovidData = pd.read_csv('owid-covid-data.csv')
GDPdata = pd.read_csv('GDP-data.csv', skiprows=4)

Covid_2021 = CovidData[CovidData['date'].str.startswith('2021')]
Covid_deaths_2021 = Covid_2021.groupby('location').last()['total_deaths_per_million'].reset_index()

GDPdata = GDPdata.rename(columns={'Country Name': 'location'})

df = pd.merge(GDPdata, Covid_deaths_2021, on='location', how='inner')


df = df[df['location'] != 'World']
df = df[df['location'] != 'Upper middle income']
df = df[df['location'] != 'Lower middle income']
df = df[df['location'] !=  'High income']
df = df[df['location'] !=  'Low income']
df = df[df['location'] !=  'European Union']
df = df[df['location'] !=  'North America']
df = df[df['location'] !=  'South America']
df = df[df['location'] !=  'Asia']
df = df[df['location'] !=  'Oceania']
df = df[df['location'] !=  'Africa']


fig2 = px.choropleth(
    df, 
    locations="Country Code",
    color="total_deaths_per_million",
    hover_name="location",
    color_continuous_scale=px.colors.sequential.Blackbody_r,
    title="Deaths per million by Country"
)

fig2.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    ),
    height=600
)

fig2.show()
import plotly.express as px
import plotly.graph_objects as go

# Load data
CovidData = pd.read_csv('owid-covid-data.csv')
GDPdata = pd.read_csv('GDP-data.csv', skiprows=4)

# Filter the data for the years 2020, 2021, and 2022
years = ['2020', '2021', '2022']
CovidData['year'] = CovidData['date'].str[:4]
CovidData = CovidData[CovidData['year'].isin(years)]

# Preprocess GDP data
GDPdata = GDPdata.rename(columns={'Country Name': 'location'})

# Preprocess Covid data for each year
def preprocess_covid_data(year):
    Covid_year = CovidData[CovidData['year'] == year]
    Covid_deaths_year = Covid_year.groupby('location').last()['total_deaths_per_million'].reset_index()
    df_year = pd.merge(GDPdata, Covid_deaths_year, on='location', how='inner')
    df_year = df_year[~df_year['location'].isin([
        'World', 'Upper middle income', 'Lower middle income', 'High income', 
        'Low income', 'European Union', 'North America', 'South America', 
        'Asia', 'Oceania', 'Africa', 'Peru'
    ])]
    return df_year

df_2020 = preprocess_covid_data('2020')
df_2021 = preprocess_covid_data('2021')
df_2022 = preprocess_covid_data('2022')

# Create a function to generate the choropleth map for a specific year
def create_choropleth(df, year):
    fig = px.choropleth(
        df, 
        locations="Country Code",
        color="total_deaths_per_million",
        hover_name="location",
        color_continuous_scale=px.colors.sequential.Blues_r,
        title=f"Deaths per million by Country ({year})"
    )
    fig.update_layout(
        geo=dict(
            showframe=False,
            showcoastlines=False,
            projection_type='equirectangular'
        ),
        height=600
    )
    return fig

# Generate choropleth maps for each year
fig_2020 = create_choropleth(df_2020, '2020')
fig_2021 = create_choropleth(df_2021, '2021')
fig_2022 = create_choropleth(df_2022, '2022')

# Create a figure with all traces
fig = go.Figure(data=fig_2020.data + fig_2021.data + fig_2022.data)

# Update the layout to include dropdown buttons
fig.update_layout(
    updatemenus=[
        {
            'buttons': [
                {
                    'label': '2020',
                    'method': 'update',
                    'args': [{'visible': [True, False, False]}]
                },
                {
                    'label': '2021',
                    'method': 'update',
                    'args': [{'visible': [False, True, False]}]
                },
                {
                    'label': '2022',
                    'method': 'update',
                    'args': [{'visible': [False, False, True]}]
                }
            ],
            'direction': 'down',
            'showactive': True,
        }
    ],
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular',
    ),
    height=600
)

# Show the figure
fig.show()
import pandas as pd

# Read the datasets
covid_df = pd.read_csv('owid-covid-data.csv')
gdp_df = pd.read_csv('GDP-data.csv', skiprows=4)

# Filter data for 2021
covid_2021_df = covid_df[covid_df['date'].str.startswith('2021')]

# Convert 'date' column to datetime for proper sorting and comparison
covid_2021_df['date'] = pd.to_datetime(covid_2021_df['date'])

# Find the last available date for each location in 2021
last_dates_2021 = covid_2021_df.groupby('location')['date'].idxmax()

# Filter the dataframe to only include rows with the last available date for each location
covid_last_2021 = covid_2021_df.loc[last_dates_2021]

# Group by location and take the last value for the required columns
covid_deaths_cases_2021 = covid_last_2021[['location', 'continent', 'total_deaths_per_million', 'total_cases_per_million', 'median_age', 'people_vaccinated_per_hundred', 'total_tests_per_thousand']].reset_index(drop=True)

GDPdata = GDPdata.rename(columns={'Country Name': 'location'})

df = pd.merge(GDPdata, covid_deaths_cases_2021, on='location', how='inner')



fig2 = px.scatter_3d(
    df,
    x="total_cases_per_million",
    y="total_deaths_per_million",
    z="2021",
    hover_name="location",
    size_max=10,
    color='continent',
    title="Comparison of Total Cases and Deaths per Million by Country",
    labels={
        "total_cases_per_million": "Total Cases per Million",
        "total_deaths_per_million": "Total Deaths per Million",
        "2021": "GDP per capita"
    }
)
fig2.show()

fig2 = px.scatter_3d(
    df,
    x="total_cases_per_million",
    y="total_deaths_per_million",
    z="people_vaccinated_per_hundred",
    hover_name="location",
    size_max=10,
    color='continent',
    title="Comparison of Total Cases and Deaths per Million by Country",
    labels={
        "total_cases_per_million": "Total Cases per Million",
        "total_deaths_per_million": "Total Deaths per Million",
        "people_vaccinated_per_hundred": "vaccinated per hundred"
    }
)
fig2.show()

fig2 = px.scatter_3d(
    df,
    x="people_vaccinated_per_hundred",
    y="total_tests_per_thousand",
    z="2021",
    hover_name="location",
    size_max=10,
    color='continent',
    title="Comparison of Total Cases and Deaths per Million by Country",
    labels={
        "2021": "Gdp per capita 2021",
        "total_tests_per_thousand": "Total tests per thousand",
        "people_vaccinated_per_hundred": "vaccinated per hundred"
    }
)
fig2.show()

fig2 = px.scatter_3d(
    df,
    x="median_age",
    y="total_deaths_per_million",
    z="2021",
    hover_name="location",
    size_max=10,
    color='continent',
    title="Comparison of Deaths per Million median age and gdp per capita by Country",
    labels={
        "2021": "Gdp per capita 2021",
        "total_tests_per_thousand": "Total tests per thousand",
        "people_vaccinated_per_hundred": "vaccinated per hundred"
    }
)
fig2.show()



fig = px.scatter(
    df,
    x="median_age",
    y="total_deaths_per_million",
    hover_name="location",
    trendline="ols",
    title="Comparison gdp per capita and Deaths per Million by Country in 2021",
    labels={
        "median_Age": "medain_age",
        "total_deaths_per_million": "Total Deaths per million"
    }
)
fig.show()

fig = px.scatter(
    df,
    x="median_age",
    y="2021",
    hover_name="location",
    trendline="ols",
    title="Comparison gdp per capita and Deaths per Million by Country in 2021",
    labels={
        "2021": "gdp per capita 2021",
        "median_age": "median age"
    }
)
fig.show()

df = df[df['continent'] !=  'Africa']

fig = px.scatter(
    df,
    x="2021",
    y="total_deaths_per_million",
    hover_name="location",
    trendline="ols",
    title="Comparison gdp per capita and Deaths per Million by Country in 2021",
    labels={
        "2021": "gdp per capita 2021",
        "total_deaths_per_million": "Total Deaths per million"
    }
)
fig
/tmp/ipykernel_48163/1670141222.py:11: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
import pandas as pd
import plotly.express as px

# Read the datasets
covid_df = pd.read_csv('owid-covid-data.csv')
gdp_df = pd.read_csv('GDP-data.csv', skiprows=4)

# Filter data for 2021
covid_2021_df = covid_df[covid_df['date'].str.startswith('2021')]

# Convert 'date' column to datetime for proper sorting and comparison
covid_2021_df['date'] = pd.to_datetime(covid_2021_df['date'])

# Find the last available date for each location in 2021
last_dates_2021 = covid_2021_df.groupby('location')['date'].idxmax()

# Filter the dataframe to only include rows with the last available date for each location
covid_last_2021 = covid_2021_df.loc[last_dates_2021]

# Group by location and take the last value for the required columns
covid_deaths_cases_2021 = covid_last_2021[['location', 'continent', 'total_deaths_per_million', 'total_cases_per_million', 'median_age', 'people_vaccinated_per_hundred', 'total_tests_per_thousand']].reset_index(drop=True)

# Preprocess GDP data
gdp_df = gdp_df.rename(columns={'Country Name': 'location'})

# Merge GDP data with COVID data
df = pd.merge(gdp_df, covid_deaths_cases_2021, on='location', how='inner')

# Handle NaN values in the '2021' (GDP per capita) column
df['2021'] = df['2021'].replace('..', float('nan')).astype(float)
df = df.dropna(subset=['2021'])


# Create a bubble scatter plot
fig = px.scatter(
    df,
    x="median_age",
    y="total_deaths_per_million",
    size="2021",  # GDP per capita for bubble size
    hover_name="location",
    color='continent',
    title="Comparison of Deaths per Million, Median Age, and GDP per Capita by Country",
    labels={
        "median_age": "Median Age",
        "total_deaths_per_million": "Total Deaths per Million",
        "2021": "GDP per Capita 2021"
    },
    size_max=60,  # Maximum size of the bubbles
    color_continuous_scale=px.colors.sequential.Blues
)

# Show the figure
fig.show()

df = df.dropna(subset=['total_deaths_per_million'])

fig = px.scatter(
    df,
    x="median_age",
    y="2021", 
    size="total_deaths_per_million",
    hover_name="location",
    color='continent',
    title="Comparison of Deaths per Million, Median Age, and GDP per Capita by Country",
    labels={
        "median_age": "Median Age",
        "total_deaths_per_million": "Total Deaths per Million",
        "2021": "GDP per Capita 2021"
    },
    size_max=60,  # Maximum size of the bubbles
    color_continuous_scale=px.colors.sequential.Blues
)

# Show the figure
fig.show()
/tmp/ipykernel_48163/3234181958.py:12: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
import pandas as pd
import plotly.express as px

# Read the datasets
covid_df = pd.read_csv('owid-covid-data.csv')
gdp_df = pd.read_csv('GDP-data.csv', skiprows=4)

# Convert 'date' column to datetime
covid_df['date'] = pd.to_datetime(covid_df['date'])

# Extract data for 2021 and 2020
covid_2021_df = covid_df[covid_df['date'].dt.year == 2021]
covid_2020_df = covid_df[covid_df['date'].dt.year == 2020]

# Find the last available date for each location in 2021 and 2020
last_dates_2021 = covid_2021_df.groupby('location')['date'].idxmax()
last_dates_2020 = covid_2020_df.groupby('location')['date'].idxmax()

# Filter the dataframe to only include rows with the last available date for each location
covid_last_2021 = covid_2021_df.loc[last_dates_2021]
covid_last_2020 = covid_2020_df.loc[last_dates_2020]

# Select required columns
covid_deaths_2021 = covid_last_2021[['location', 'total_deaths_per_million']]
covid_deaths_2020 = covid_last_2020[['location', 'total_deaths_per_million']]

# Merge 2020 and 2021 data on location
covid_deaths = pd.merge(covid_deaths_2021, covid_deaths_2020, on='location', suffixes=('_2021', '_2020'))

# Calculate the difference in total deaths per million between 2021 and 2020
covid_deaths['total_deaths_per_million_2021'] = covid_deaths['total_deaths_per_million_2021'] - covid_deaths['total_deaths_per_million_2020']

# Preprocess GDP data
gdp_df = gdp_df.rename(columns={'Country Name': 'location'})

# Merge GDP data with COVID data
df = pd.merge(gdp_df, covid_deaths, on='location', how='inner')

# Add other required columns from the covid_last_2021 data
other_columns = covid_last_2021[['location', 'continent', 'median_age', 'people_vaccinated_per_hundred', 'total_tests_per_thousand']]
df = pd.merge(df, other_columns, on='location', how='inner')

# Handle NaN values in the '2021' (GDP per capita) column
df['2021'] = df['2021'].replace('..', float('nan')).astype(float)
df = df.dropna(subset=['2021', 'total_deaths_per_million_2021'])

# Create the scatter plot
fig = px.scatter(
    df,
    x="median_age",
    y="2021", 
    size="total_deaths_per_million_2021",
    hover_name="location",
    color='continent',
    title="Comparison of Deaths per Million, Median Age, and GDP per Capita by Country",
    labels={
        "median_age": "Median Age",
        "total_deaths_per_million_2021": "Total Deaths per Million (2021)",
        "2021": "GDP per Capita 2021"
    },
    size_max=60,  # Maximum size of the bubbles
    color_continuous_scale=px.colors.sequential.Blues,
    height=800  # Adjust the height of the figure
)

# Show the figure
fig.show()

fig = px.scatter(
    df,
    x="median_age",
    y="total_deaths_per_million_2021",
    size="2021",  # GDP per capita for bubble size
    hover_name="location",
    color='continent',
    title="Comparison of Deaths per Million, Median Age, and GDP per Capita by Country",
    labels={
        "median_age": "Median Age",
        "total_deaths_per_million": "Total Deaths per Million",
        "2021": "GDP per Capita 2021"
    },
    size_max=60,  # Maximum size of the bubbles
    color_continuous_scale=px.colors.sequential.Blues,
    height=800  # Adjust the height of the figure
)

# Show the figure
fig.show()
import pandas as pd
import plotly.graph_objs as go

covid_df = pd.read_csv('owid-covid-data.csv')
covid_2021_df = covid_df[covid_df['date'].str.startswith('2021')]

income_categories = ['High income', 'Upper middle income', 'Lower middle income', 'Low income']

filtered_df = covid_2021_df[covid_2021_df['location'].isin(income_categories)]

aggregated_df = filtered_df.groupby('location').last()['total_cases_per_million'].reset_index()

trace = go.Bar(
    x=aggregated_df['location'],
    y=aggregated_df['total_cases_per_million']
)

layout = go.Layout(
    title='Total Cases per Million by Income Category',
    xaxis=dict(title='Income Category'),
    yaxis=dict(title='Total Cases per Million')
)

fig = go.Figure(data=[trace], layout=layout)

fig.show()
import pandas as pd
import plotly.graph_objs as go

covid_df = pd.read_csv('owid-covid-data.csv')
gdp_df = pd.read_csv('GDP-data.csv', skiprows=4)

covid_2021_df = covid_df[covid_df['date'].str.startswith('2021')]

exclude_locations = ['World', 'Upper middle income', 'Lower middle income', 'High income', 'Low income',
                     'European Union', 'North America', 'South America', 'Asia', 'Oceania', 'Africa']

covid_2021_df = covid_2021_df[~covid_2021_df['location'].isin(exclude_locations)]

variables = ['total_deaths_per_million', 'total_cases_per_million', 'people_vaccinated_per_hundred', 'excess_mortality_cumulative_per_million']

last_values_dfs = {}

for var in variables:
    last_values_dfs[var] = covid_2021_df.groupby('location').last()[var].reset_index()

merged_df = last_values_dfs[variables[0]]

for var in variables[1:]:
    merged_df = pd.merge(merged_df, last_values_dfs[var], on='location', how='left')

gdp_df = gdp_df.rename(columns={"Country Name": 'location'})
gdp_df = gdp_df[['location', "2021"]]

final_merged_df = pd.merge(merged_df, gdp_df, on='location', how='inner')

final_merged_df = final_merged_df.rename(columns={"2021": "GDP_2021"})


for var in variables + ['GDP_2021']:
    final_merged_df[f'{var}_category'] = pd.qcut(final_merged_df[var], q=3, labels=['low', 'medium', 'high'])

category_orders = {
    'GDP_2021_category': ['low', 'medium', 'high'],
    'total_deaths_per_million_category': ['low', 'medium', 'high'],
    'total_cases_per_million_category': ['low', 'medium', 'high'],
    'people_vaccinated_per_hundred_category': ['low', 'medium', 'high'],
    'excess_mortality_cumulative_per_million_category': ['low', 'medium', 'high'] 
}

for var in variables + ['GDP_2021']:
    final_merged_df[f'{var}_category'] = final_merged_df[f'{var}_category'].cat.add_categories('nan').fillna('nan')

dimensions = [
    {'label': 'GDP 2021', 'values': final_merged_df['GDP_2021_category'], 'categoryorder': 'array', 'categoryarray': category_orders['GDP_2021_category']},
    {'label': 'Total Deaths per Million', 'values': final_merged_df['total_deaths_per_million_category'], 'categoryorder': 'array', 'categoryarray': category_orders['total_deaths_per_million_category']},
    {'label': 'Total Cases per Million', 'values': final_merged_df['total_cases_per_million_category'], 'categoryorder': 'array', 'categoryarray': category_orders['total_cases_per_million_category']},
    {'label': 'People Vaccinated per Hundred', 'values': final_merged_df['people_vaccinated_per_hundred_category'], 'categoryorder': 'array', 'categoryarray': category_orders['people_vaccinated_per_hundred_category']},
    {'label': 'Excess Mortality per Million', 'values': final_merged_df['excess_mortality_cumulative_per_million_category'], 'categoryorder': 'array', 'categoryarray': category_orders['excess_mortality_cumulative_per_million_category']}  # Corrected label and reference to categorical column
]

fig = go.Figure(data=[
    go.Parcats(
        dimensions=dimensions,
        line={'color': final_merged_df['GDP_2021_category'].cat.codes, 'colorscale': 'Viridis', 'showscale': False},  # Set showscale=False to hide the color scale
        hoverinfo='count+probability',
        arrangement='freeform'
    )
])

fig.update_layout(
    title='Parallel Categories Plot of COVID-19 and GDP Data',
    height=600
)

fig.show()
import pandas as pd
import plotly.express as px


covid_file_path = 'owid-covid-data.csv'
df_covid = pd.read_csv(covid_file_path)


df_covid['date'] = pd.to_datetime(df_covid['date'])
df_covid_2021 = df_covid[df_covid['date'].dt.year == 2021]


last_values_covid = df_covid_2021.groupby('location').apply(lambda x: x.loc[x['date'].idxmax()]).reset_index(drop=True)


filtered_data_covid = last_values_covid[['location', 'total_tests_per_thousand']].dropna()


gdp_file_path = 'GDP-data.csv'
df_gdp = pd.read_csv(gdp_file_path, skiprows=4)


df_gdp_2021 = df_gdp[['Country Name', '2021']].rename(columns={'Country Name': 'location', '2021': 'gdp_per_capita'}).dropna()


merged_data = pd.merge(filtered_data_covid, df_gdp_2021, on='location')


merged_data = merged_data[(merged_data['total_tests_per_thousand'] != 0) & (merged_data['gdp_per_capita'] != 0)]


Q1_tests = merged_data['total_tests_per_thousand'].quantile(0.25)
Q3_tests = merged_data['total_tests_per_thousand'].quantile(0.75)
IQR_tests = Q3_tests - Q1_tests

Q1_gdp = merged_data['gdp_per_capita'].quantile(0.25)
Q3_gdp = merged_data['gdp_per_capita'].quantile(0.75)
IQR_gdp = Q3_gdp - Q1_gdp


lower_bound_tests = Q1_tests - 1.5 * IQR_tests
upper_bound_tests = Q3_tests + 1.5 * IQR_tests
lower_bound_gdp = Q1_gdp - 1.5 * IQR_gdp
upper_bound_gdp = Q3_gdp + 1.5 * IQR_gdp


filtered_data_no_outliers = merged_data[
    (merged_data['total_tests_per_thousand'] >= lower_bound_tests) &
    (merged_data['total_tests_per_thousand'] <= upper_bound_tests) &
    (merged_data['gdp_per_capita'] >= lower_bound_gdp) &
    (merged_data['gdp_per_capita'] <= upper_bound_gdp)
]


fig = px.scatter(filtered_data_no_outliers, x='total_tests_per_thousand', y='gdp_per_capita',
                 title='Tests per Thousand vs GDP per Capita (2021)',
                 labels={'total_tests_per_thousand': 'Total Tests per Thousand', 'gdp_per_capita': 'GDP per Capita'},
                 trendline='ols')

fig.show()
/tmp/ipykernel_48163/111027177.py:13: DeprecationWarning:

DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
import pandas as pd
import plotly.express as px


file_path = 'owid-covid-data.csv'
df = pd.read_csv(file_path)


df['date'] = pd.to_datetime(df['date'])
df_2021 = df[df['date'].dt.year == 2021]


last_values = df_2021.groupby('location').apply(lambda x: x.loc[x['date'].idxmax()]).reset_index(drop=True)
filtered_data = last_values[['location', 'total_tests_per_thousand', 'total_cases']].dropna()


zeros_in_tests = filtered_data[filtered_data['total_tests_per_thousand'] == 0]
zeros_in_cases = filtered_data[filtered_data['total_cases'] == 0]


filtered_data = filtered_data[(filtered_data['total_tests_per_thousand'] != 0) & (filtered_data['total_cases'] != 0)]


Q1_tests = filtered_data['total_tests_per_thousand'].quantile(0.25)
Q3_tests = filtered_data['total_tests_per_thousand'].quantile(0.75)
IQR_tests = Q3_tests - Q1_tests

Q1_cases = filtered_data['total_cases'].quantile(0.25)
Q3_cases = filtered_data['total_cases'].quantile(0.75)
IQR_cases = Q3_cases - Q1_cases


lower_bound_tests = Q1_tests - 1.5 * IQR_tests
upper_bound_tests = Q3_tests + 1.5 * IQR_tests
lower_bound_cases = Q1_cases - 1.5 * IQR_cases
upper_bound_cases = Q3_cases + 1.5 * IQR_cases


filtered_data_no_outliers = filtered_data[
    (filtered_data['total_tests_per_thousand'] >= lower_bound_tests) &
    (filtered_data['total_tests_per_thousand'] <= upper_bound_tests) &
    (filtered_data['total_cases'] >= lower_bound_cases) &
    (filtered_data['total_cases'] <= upper_bound_cases)
]


fig = px.scatter(filtered_data_no_outliers, x='total_tests_per_thousand', y='total_cases',
                 title='Tests per Thousand vs Total Cases (2021)',
                 labels={'total_tests_per_thousand': 'Total Tests per Thousand', 'total_cases': 'Total Cases'},
                 trendline='ols')

fig.show()
/tmp/ipykernel_48163/3012838731.py:13: DeprecationWarning:

DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Load data
CovidData = pd.read_csv('owid-covid-data.csv')
GDPdata = pd.read_csv('GDP-data.csv', skiprows=4)

# Filter the data for the years 2020, 2021, and 2022
years = ['2020', '2021', '2022']
CovidData['year'] = CovidData['date'].str[:4]
CovidData = CovidData[CovidData['year'].isin(years)]

# Preprocess GDP data
GDPdata = GDPdata.rename(columns={'Country Name': 'location'})

# Preprocess Covid data for each year
def preprocess_covid_data(year):
    Covid_year = CovidData[CovidData['year'] == year]
    Covid_deaths_year = Covid_year.groupby('location').last()['total_deaths_per_million'].reset_index()
    df_year = pd.merge(GDPdata, Covid_deaths_year, on='location', how='inner')
    df_year = df_year[~df_year['location'].isin([
        'World', 'Upper middle income', 'Lower middle income', 'High income', 
        'Low income', 'European Union', 'North America', 'South America', 
        'Asia', 'Oceania', 'Africa', 'Peru'
    ])]
    return df_year

# Get the cumulative deaths for each year
df_2020 = preprocess_covid_data('2020')
df_2021 = preprocess_covid_data('2021')
df_2022 = preprocess_covid_data('2022')

# Calculate the yearly increase in deaths per million
df_2021['total_deaths_per_million'] = df_2021['total_deaths_per_million'] - df_2020['total_deaths_per_million']
df_2022['total_deaths_per_million'] = df_2022['total_deaths_per_million'] - df_2021['total_deaths_per_million']

# Create a function to generate the choropleth map for a specific year
def create_choropleth(df, year):
    fig = px.choropleth(
        df, 
        locations="Country Code",
        color="total_deaths_per_million",
        hover_name="location",
        color_continuous_scale=px.colors.sequential.Blues,
        range_color=(0, max_deaths_per_million),
        title=f"Deaths per million by Country ({year})"
    )
    fig.update_layout(
        geo=dict(
            showframe=False,
            showcoastlines=False,
            projection_type='equirectangular'
        ),
        height=600
    )
    return fig

# Get the maximum value of total_deaths_per_million for consistent color scaling
max_deaths_per_million = max(
    df_2020['total_deaths_per_million'].max(), 
    df_2021['total_deaths_per_million'].max(), 
    df_2022['total_deaths_per_million'].max()
)

# Generate choropleth maps for each year
fig_2020 = create_choropleth(df_2020, '2020')
fig_2021 = create_choropleth(df_2021, '2021')
fig_2022 = create_choropleth(df_2022, '2022')

# Create a figure with all traces
fig = go.Figure(data=fig_2020.data + fig_2021.data + fig_2022.data)

# Update the layout to include dropdown buttons
fig.update_layout(
    updatemenus=[
        {
            'buttons': [
                {
                    'label': '2020',
                    'method': 'update',
                    'args': [{'visible': [True, False, False]}]
                },
                {
                    'label': '2021',
                    'method': 'update',
                    'args': [{'visible': [False, True, False]}]
                },
                {
                    'label': '2022',
                    'method': 'update',
                    'args': [{'visible': [False, False, True]}]
                }
            ],
            'direction': 'down',
            'showactive': True,
        }
    ],
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular',
    ),
    height=600
)

# Show the figure
fig.show()
import pandas as pd
import plotly.express as px

# Read the datasets
covid_df = pd.read_csv('owid-covid-data.csv')
gdp_df = pd.read_csv('GDP-data.csv', skiprows=4)

# Convert 'date' column to datetime
covid_df['date'] = pd.to_datetime(covid_df['date'])

# Extract data for 2021 and 2020
covid_2021_df = covid_df[covid_df['date'].dt.year == 2021]
covid_2020_df = covid_df[covid_df['date'].dt.year == 2020]

# Find the last available date for each location in 2021 and 2020
last_dates_2021 = covid_2021_df.groupby('location')['date'].idxmax()
last_dates_2020 = covid_2020_df.groupby('location')['date'].idxmax()

# Filter the dataframe to only include rows with the last available date for each location
covid_last_2021 = covid_2021_df.loc[last_dates_2021]
covid_last_2020 = covid_2020_df.loc[last_dates_2020]

# Select required columns
covid_deaths_2021 = covid_last_2021[['location', 'total_deaths_per_million']]
covid_deaths_2020 = covid_last_2020[['location', 'total_deaths_per_million']]

# Merge 2020 and 2021 data on location
covid_deaths = pd.merge(covid_deaths_2021, covid_deaths_2020, on='location', suffixes=('_2021', '_2020'))

# Calculate the difference in total deaths per million between 2021 and 2020
covid_deaths['total_deaths_per_million_2021'] = covid_deaths['total_deaths_per_million_2021'] - covid_deaths['total_deaths_per_million_2020']

# Preprocess GDP data
gdp_df = gdp_df.rename(columns={'Country Name': 'location'})

# Merge GDP data with COVID data
df = pd.merge(gdp_df, covid_deaths, on='location', how='inner')

# Add other required columns from the covid_last_2021 data
other_columns = covid_last_2021[['location', 'continent', 'median_age', 'people_vaccinated_per_hundred', 'total_tests_per_thousand', 'human_development_index', 'population_density', 'icu_patients_per_million']]
df = pd.merge(df, other_columns, on='location', how='inner')

# Handle NaN values in the '2021' (GDP per capita) column
df['2021'] = df['2021'].replace('..', float('nan')).astype(float)
df = df.dropna(subset=['2021', 'total_deaths_per_million_2021'])

# Create the first scatter plot: Human Development Index vs Total Deaths per Million (2021)
fig1 = px.scatter(
    df,
    x="human_development_index",
    y="total_deaths_per_million_2021",
    hover_name="location",
    trendline='ols',
    title="Human Development Index vs Total Deaths per Million (2021)",
    labels={
        "human_development_index": "Human Development Index",
        "total_deaths_per_million_2021": "Total Deaths per Million (2021)"
    }
)

# Create the second scatter plot: Population Density vs Total Deaths per Million (2021)
fig2 = px.scatter(
    df,
    x="total_deaths_per_million_2021",
    y="population_density",
    hover_name="location",
    color='continent',
    title="Population Density vs Total Deaths per Million (2021)",
    labels={
        "population_density": "Population Density",
        "total_deaths_per_million_2021": "Total Deaths per Million (2021)"
    }
)

# Create the third scatter plot: ICU Patients per Million vs Total Deaths per Million (2021)
fig3 = px.scatter(
    df,
    x="icu_patients_per_million",
    y="total_deaths_per_million_2021",
    hover_name="location",
    color='continent',
    title="ICU Patients per Million vs Total Deaths per Million (2021)",
    labels={
        "icu_patients_per_million": "ICU Patients per Million",
        "total_deaths_per_million_2021": "Total Deaths per Million (2021)"
    }
)

fig4 = px.scatter(
    df,
    x="human_development_index",
    y="2021",
    size="total_deaths_per_million_2021",
    hover_name="location",
    trendline='ols',
    title="Human Development Index vs Total Deaths per Million (2021)",
    labels={
        "human_development_index": "Human Development Index",
        "2021": "GDP per capita (2021)"
    },
    height=600
)



# Show the figures
fig1.show()
fig2.show()
fig3.show()
fig4.show()
import pandas as pd
import plotly.express as px

# Read the datasets
CovidData = pd.read_csv('owid-covid-data.csv')
GDPdata = pd.read_csv('GDP-data.csv', skiprows=4)

# Filter COVID data for 2021 and get the last available data for each location
Covid_2021 = CovidData[CovidData['date'].str.startswith('2021')]
Covid_deaths_2021 = Covid_2021.groupby('location').last().reset_index()

# Preprocess GDP data
GDPdata = GDPdata.rename(columns={'Country Name': 'location'})

# Merge GDP data with COVID data
df = pd.merge(GDPdata, Covid_deaths_2021, on='location', how='inner')

# Filter out non-country entries
non_countries = ['World', 'Upper middle income', 'Lower middle income', 'High income', 'Low income', 
                 'European Union', 'North America', 'South America', 'Asia', 'Oceania', 'Africa']
df = df[~df['location'].isin(non_countries)]

# Filter for European countries only
df = df[df['continent'] == 'Europe']

# Calculate correlation
correlation = df['total_deaths_per_million'].corr(df['2021'])
print(f"Correlation between GDP and Total Deaths per Million in Europe: {correlation}")

# Create the scatter plot
fig = px.scatter(
    df,
    x="2021",
    y="total_deaths_per_million",
    hover_name="location",
    trendline="ols",
    title="Comparison of GDP and Deaths per Million by Country (Europe)",
    labels={
        "2021": "GDP",
        "total_deaths_per_million": "Total Deaths per Million"
    }
)

# Show the figure
fig.show()
Correlation between GDP and Total Deaths per Million in Europe: -0.47824773806092197